Update turbomind modeling infrastructure by lzhangzz · Pull Request #4557 · InternLM/lmdeploy

lzhangzz · 2026-04-27T06:56:25Z

No description provided.

…t loading, and model loader 674 squashed commits: reorganize turbomind directory structure, refactor weight loading to support heterogeneous weight data types, add WeightFormat enum, replace BaseOutputModel/TextModelLoader with unified ModelLoader, eliminate data_format threading from Linear, and remove dead code.

Copilot

Pull request overview

This PR refactors TurboMind’s modeling + conversion stack by replacing the legacy “deploy/source_model + config dataclasses” pipeline with a spec/builder-driven module system, adding a C++ module registry and new weight-module types, and updating engine/model code to consume the new weight tree.

Changes:

Introduces a registry-backed C++ core::Module infrastructure (plus DataFormat) and new modular weight classes (Linear/Norm/Attention/FFN/MoE/DeltaNet/ModelRoot/ModelWeight).
Reworks the Python-side TurboMind converter to use TextModelSpec + builders/model loader, and removes the legacy lmdeploy.turbomind.deploy pipeline.
Updates engine/model runtime plumbing (TurboMind API, Engine/SequenceManager, llama layers) to use the new module/weight tree.

Reviewed changes

Copilot reviewed 131 out of 131 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/test_lmdeploy/test_turbomind/test_converter.py	Removes legacy converter tests; leaves a remaining test that still references removed legacy modules.
tests/test_lmdeploy/test_turbomind/test_compressed_tensors.py	Adjusts compressed-tensors tests but still imports removed legacy deploy modules.
tests/test_lmdeploy/test_converter.py	Adds tests for `_deep_merge` plus a logging capture fixture.
src/turbomind/utils/memory_utils.h	Declares dtype-cast kernel + in-place ensure-float-dtype helper.
src/turbomind/utils/memory_utils.cu	Implements dtype casting and `EnsureFloatDtype`.
src/turbomind/turbomind.h	Updates TurboMind API to accept `EngineConfig` and expose module roots + TP ranks.
src/turbomind/python/CMakeLists.txt	Ensures static registrars are linked into the Python extension via `--whole-archive`.
src/turbomind/models/output_processor.h	Refactors ctor signature to avoid `ModelParam` dependency.
src/turbomind/models/output_processor.cc	Implements updated `OutputProcessor` ctor signature.
src/turbomind/models/norm_weight.h	Adds a `NormWeight` module type.
src/turbomind/models/norm_weight.cc	Registers and prepares `NormWeight` (dtype ensure).
src/turbomind/models/moe_weight.h	Adds a modular `MoeWeight` definition/config.
src/turbomind/models/moe_weight.cc	Implements MoE expert linking into a fused block view.
src/turbomind/models/model_weight.h	Adds root `ModelWeight` module for full weight tree.
src/turbomind/models/model_weight.cc	Implements `ModelWeight` prepare/verify + derived metadata.
src/turbomind/models/model_root.h	Adds `ModelRoot` sentinel for stream/allocator ownership.
src/turbomind/models/model_root.cc	Implements `ModelRoot` runtime context + prepare checks.
src/turbomind/models/llama/unified_decoder.h	Updates decoder to consume `ModelWeight`/`DecoderLayerWeight`.
src/turbomind/models/llama/unified_attention_layer.h	Refactors attention layer to use new `AttentionWeight` and rope config.
src/turbomind/models/llama/moe_ffn_layer.h	Refactors MoE FFN layer to use `MoeWeight`.
src/turbomind/models/llama/llama_rope.h	Moves rope param helpers out (now in AttentionWeight impl).
src/turbomind/models/llama/llama_params.h	Replaces model/attn/moe params with `EngineConfig`-based `EngineParam`.
src/turbomind/models/llama/SequenceManager.h	Updates ctor signature to explicit scalar params (no `ModelParam`).
src/turbomind/models/llama/SequenceManager.cc	Implements updated SequenceManager state sizing and cache layout.
src/turbomind/models/llama/LlamaWeight.h	Removes old monolithic `LlamaWeight`.
src/turbomind/models/llama/LlamaWeight.cc	Removes old monolithic `LlamaWeight` implementation.
src/turbomind/models/llama/LlamaLinear.h	Switches linear ops to new `LinearWeight`.
src/turbomind/models/llama/LlamaLinear.cu	Implements GEMM path using `LinearWeight` formats/descriptors.
src/turbomind/models/llama/LlamaFfnLayer.h	Refactors FFN layer to consume `FfnWeight`.
src/turbomind/models/llama/LlamaFfnLayer.cc	Updates FFN forward path for new weight module layout.
src/turbomind/models/llama/LlamaDenseWeight.h	Removes old dense/attention/ffn weight structs.
src/turbomind/models/llama/LlamaDecoderLayerWeight.h	Removes old llama-specific decoder layer weight.
src/turbomind/models/llama/LlamaDecoderLayerWeight.cc	Removes old llama-specific decoder layer weight impl.
src/turbomind/models/llama/GatedDeltaNetWeight.h	Removes old DeltaNet weight module.
src/turbomind/models/llama/GatedDeltaNetWeight.cc	Removes old DeltaNet weight module impl.
src/turbomind/models/llama/GatedDeltaNetLayer.h	Updates GDN layer to consume `DeltaNetWeight`.
src/turbomind/models/llama/CMakeLists.txt	Adjusts llama static lib sources (legacy pieces removed).
src/turbomind/models/linear_weight.h	Adds new `LinearWeight` module and format helpers.
src/turbomind/models/language_model.h	Switches `LanguageModel` to accept `ModelWeight`.
src/turbomind/models/input_processor.h	Refactors ctor to avoid `ModelParam` dependency.
src/turbomind/models/input_processor.cc	Implements updated ctor; allocates embed buffers from explicit dims/dtype.
src/turbomind/models/ffn_weight.h	Adds `FfnWeight` module and config.
src/turbomind/models/ffn_weight.cc	Implements `FfnWeight::prepare` (epilogue + grouped flag propagation).
src/turbomind/models/delta_net_weight.h	Adds `DeltaNetWeight` module and config.
src/turbomind/models/delta_net_weight.cc	Implements `DeltaNetWeight::prepare` dtype enforcement.
src/turbomind/models/decoder_layer_weight.h	Adds architecture-independent `DecoderLayerWeight` composite.
src/turbomind/models/decoder_layer_weight.cc	Implements verify rules and registers the module.
src/turbomind/models/attention_weight.h	Adds `AttentionWeight` module and embedded `RopeConfig`.
src/turbomind/models/attention_weight.cc	Implements rope kernel param init and registers `AttentionWeight`.
src/turbomind/models/CMakeLists.txt	Adds new module sources to `models` library; removes legacy llama weight sources.
src/turbomind/kernels/quantization.cu	Makes `QuantizeSymm` dtype-dispatched.
src/turbomind/kernels/gemm/convert_v3.cu	Comment tweak for “no quantization” case.
src/turbomind/kernels/gemm/CMakeLists.txt	Comments out legacy gemm test executables.
src/turbomind/engine/engine_config.h	Introduces `EngineConfig` struct (X-macro fields).
src/turbomind/engine/engine.h	Updates Engine ctor signature (now takes `ModelWeight`).
src/turbomind/engine/engine.cc	Refactors Engine to derive runtime fields from `ModelWeight` rather than `ModelParam`.
src/turbomind/core/test_data_format.cc	Adds Catch2 tests for `DataFormat`/`ResolveLinearWeightFormat`.
src/turbomind/core/registry.h	Adds module type registry + registration macro.
src/turbomind/core/registry.cc	Implements module registry.
src/turbomind/core/module.cc	Rewrites module base + ModuleList implementation and hooks up registry-based creation.
src/turbomind/core/data_format.h	Adds `DataFormat` + quant-param descriptors and helpers.
src/turbomind/core/data_format.cc	Implements `DataFormat` logic and `ResolveLinearWeightFormat`.
src/turbomind/core/CMakeLists.txt	Builds new core sources + adds data_format test.
src/turbomind/CMakeLists.txt	Adjusts turbomind link libs (removes yaml-cpp).
scripts/test_turbomind_model.py	Adds a CLI smoke-test script for TurboMind models.
lmdeploy/turbomind/supported_models.py	Narrows/updates supported arch mapping and simplifies checks.
lmdeploy/turbomind/spec.py	Adds `TextModelSpec` base (HF parsing → C++ configs + weight commits).
lmdeploy/turbomind/models/base.py	Introduces new `INPUT_MODELS` registry for spec classes.
lmdeploy/turbomind/models/init.py	Imports/registers available specs.
lmdeploy/turbomind/model_loader.py	Adds `ModelLoader` to bind runtime handles and load weights into TM.
lmdeploy/turbomind/loader.py	Adds `all_items()` API to loaders for spec-driven loading.
lmdeploy/turbomind/linear.py	Adds `Linear` bundle type and padding/concat helpers.
lmdeploy/turbomind/deploy/target_model/fp.py	Removes legacy deploy output model stub.
lmdeploy/turbomind/deploy/target_model/init.py	Removes legacy deploy target_model exports.
lmdeploy/turbomind/deploy/source_model/xcomposer2.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/molmo.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/mixtral.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/minicpmv.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/llava.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/internvl.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/internlm2.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/gpt_oss.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/glm4_moe_lite.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/glm4.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/deepseek_vl.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/deepseek2.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/base.py	Removes legacy deploy registries/base classes.
lmdeploy/turbomind/deploy/source_model/baichuan.py	Removes legacy deploy reader/model.
lmdeploy/turbomind/deploy/source_model/init.py	Removes legacy deploy source_model imports.
lmdeploy/turbomind/deploy/policy.py	Removes legacy tensor processing policy helpers.
lmdeploy/turbomind/deploy/parameter.py	Removes legacy parameter export utilities.
lmdeploy/turbomind/deploy/config.py	Removes legacy turbomind model config dataclasses.
lmdeploy/turbomind/deploy/init.py	Removes legacy deploy package init.
lmdeploy/turbomind/builders/norm.py	Adds builder for Norm module commits.
lmdeploy/turbomind/builders/moe.py	Adds builder for MoE non-expert params and gate commits.
lmdeploy/turbomind/builders/module_list.py	Adds builder for `ModuleList` container commits.
lmdeploy/turbomind/builders/mla.py	Adds MLA fold/pad pipeline + builder.
lmdeploy/turbomind/builders/deltanet.py	Adds DeltaNet fusion helpers + builder.
lmdeploy/turbomind/builders/decoder_layer.py	Adds a decoder-layer container builder.
lmdeploy/turbomind/builders/attention.py	Adds attention fusion pipeline + builder.
lmdeploy/turbomind/builders/init.py	Exposes builder APIs.
lmdeploy/messages.py	Changes `Response.__repr__` formatting.
lmdeploy/archs.py	Changes ImportError handling in backend auto-selection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Move dequant/transform utilities from _base.py into linear.py as the canonical home for all Linear operations - Unify _ensure_compatible_formats and dequant_mixed into a single dequant_mixed function that triggers on any format diversity - Drop 'Spec' suffix from all turbomind model classes and files (TextModelSpec → TextModel, Qwen3TextSpec → Qwen3TextModel, etc.) - Extract TextModelBuilder from _base.py into builders/text_model.py - Move model-specific qk_norm from TextModel to Qwen3 and Qwen3.5 - Fix .gitignore typo (trubomind → turbomind)

irexyc · 2026-05-06T06:46:25Z

For the /nvme4/huggingface_hub/hub/models--Qwen--Qwen3.5-2B/snapshots/15852e8c16360a2fea060d615a32b45270f8a8fc/ model, the results differ from those of the main branch.

this branch

>>> pipe('hello')
text=Hello! I am an AI assistant based on the **Qwen** model. I am **Qwen**3.17************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************

main branch

>>> pipe('hello')
text='Hello! How can I help you today?'

…anup Align all turbomind source models with Qwen3 conventions: - Drop engine_cfg from model signatures, wire data_type via Context - Add Context, ParallelGroup, make_moe_config, make_mla_config helpers - Collapse make_*_config functions by removing per-function data_type - Remove dead fields from C++ configs (has_bias, hidden_dim, etc.) - Remove _layer_pattern, _embed_key, _norm_key from all models - Unify FFN padding with group-based pad/round_up helpers - Add TP padding for block-quantized formats and GEMM K-alignment - Remove dead code: _pad_1d, _norm, pad_in_dim, _softmax_scale - Add InternVL3.5, InternLM2/3, Llama turbomind support - Rename fused_moe to is_expert, align Python/C++ config fields - Use direct HF config access, Transformers type hints, all-params loader - Clean up imports, docstrings, formatting across all model files

The raise in archs.py was a debugging leftover. The repr in messages.py needs !r to properly escape control characters.

Replace turbomind's flat-string-prefix model loading with a typed Checkpoint (storage backend) and Prefix (path navigation) pair, making source models stateless and decoupling topology from storage. Core infrastructure: - Add Checkpoint ABC with SafetensorsCheckpoint and PytorchCheckpoint subclasses - Add Prefix for typed checkpoint path navigation (+, slices, get, pop) - Add .cuda() device transfer and .pop() for single-use weights to Checkpoint - Replace Prefix.chunks with Prefix.slices for cleaner layer iteration - Thread index parameter through resolver, dropping post-hoc indexing Architecture migration: - Bridge ModelLoader.export to Checkpoint/Prefix via per-class _uses_prefix flag - Make WeightFormatResolver accept Prefix alongside legacy (params, prefix) tuple - Migrate all 8 architectures to Checkpoint/Prefix: llama, qwen2, qwen3, internvl3_5, internlm2, glm4_moe_lite, qwen3_5, gpt_oss - Drop legacy dict-based loading after migration - Inline shard walking into checkpoint.py and strip loader.py - Remove layer_progress helper, unused imports Norm refactoring: - Change norm() to accept Prefix instead of raw tensor - Use pop() for single-use norm weights across all architectures - Update qwen3_5 norm() calls to use Prefix with transforms Packed expert index: - Add index parameter to Checkpoint.get/pop for packed expert weight access - Fix get() vs pop() semantics for packed expert weights Style: - Reflow long lines, fix lint violations (UP037, imports), align whitespace

- Resolve conflicts by keeping our refactored architecture - Thread trust_remote_code through from_pretrained → __init__ → _from_hf → is_supported → get_tm_config → get_model_arch - Add is_cublas_grouped check to _should_fuse_silu: disable fused SiLU for bf16 MoE on SM100+ GPUs (CublasGroupedKernel limitation)

…heck

lzhangzz added 2 commits April 27, 2026 06:40

restore skills and CLAUDE.md from upstream

990f70b

lvhan028 requested review from Copilot, irexyc and lvhan028 April 27, 2026 10:41

lvhan028 added the improvement label Apr 27, 2026

Copilot started reviewing on behalf of lvhan028 April 27, 2026 10:41 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

lzhangzz added 2 commits April 27, 2026 15:07

chore: add CLAUDE.local.md to .gitignore

a1740fb

lvhan028 mentioned this pull request Apr 29, 2026

[Feature] TurboMind后端支持视觉模型 #4562

Open

lzhangzz added 3 commits May 6, 2026 09:07

fix: revert debug raise and restore repr escaping

c39b5bb

The raise in archs.py was a debugging leftover. The repr in messages.py needs !r to properly escape control characters.

lvhan028 mentioned this pull request May 8, 2026

feat: Turbomind linear gdn prefix caching #4465

Open

lzhangzz added 6 commits May 8, 2026 06:26

lint

208cc07

test: remove stale turbomind test files

b16cf6c

test: remove stale converter test file

4c078d6

fix(turbomind): use weight.weight_format.dtype in is_cublas_grouped c…

dc9a332

…heck

test: pass trust_remote_code=True in turbomind smoke test

f6a646c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update turbomind modeling infrastructure#4557

Update turbomind modeling infrastructure#4557
lzhangzz wants to merge 13 commits intoInternLM:mainfrom
lzhangzz:modeling-1b

lzhangzz commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

irexyc commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lzhangzz commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

irexyc commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants